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Abstract 

Based on comparative phylogenetic analysis of 16S rRNA gene sequences deposited In an RDP database, we constructed a 
local database of thaumarchaeotal 16S rRNA gene sequences and developed a novel PGR primer specific for the archaeal 
phylum Thaumarchaeota. Among 9,727 quality-filtered (chimeral-checked, size >1.2 kb) archaeal sequences downloaded 
from the RDP database, 1,549 thaumarchaeotal sequences were identified and included in our local database. In our study, 
Thaumarchaeota included archaeal groups MG-I, SAGMCG-I, SCG, FSCG, RC, and HWCG-III, forming a monophyletic group in 
the phylogenetic tree. Cluster analysis revealed 1 14 phylotypes for Thaumarchaeota. The majority of the phylotypes (66.7%) 
belonged to the MG-I and SCG, which together contained most (93.9%) of the thaumarchaeotal sequences in our local 
database. A phylum-directed primer was designed from a consensus sequence of the phylotype sequences, and the 
primer's specificity was evaluated for coverage and tolerance both in silico and empirically. The phylum-directed primer, 
designated THAUM-494, showed >90% coverage for Thaumarchaeota and <1% tolerance to non-target taxa, indicating 
high specificity. To validate this result experimentally, PCRs were performed with THAUM-494 in combination with a 
universal archaeal primer {ARC917R or 1017FAR) and DNAs from five environmental samples to construct clone libraries. 
THAUM-494 showed a satisfactory specificity in empirical studies, as expected from the in silico results. Phylogenetic 
analysis of 859 cloned sequences obtained from 10 clone libraries revealed that >95% of the amplified sequences belonged 
to Thaumarchaeota. The most frequently sampled thaumarchaeotal subgroups in our samples were SCG, MG-I, and 
SAGMCG-I. To our knowledge, THAUM-494 is the first phylum-level primer for Thaumarchaeota. Furthermore, the high 
coverage and low tolerance of THAUM-494 will make it a potentially valuable tool in understanding the phylogenetic 
diversity and ecological niche of Thaumarchaeota. 
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Introduction 

The archaeal phylum Thaumarchaeota was proposed in 2008, 
distinguishing mesophilic ammonia-oxidizing archaeal (AOA) 
lineages from hyperthermophiKc Crmarchaeota lineages [1]. This 
proposal was based on archaeal phylogeny inferred from rRNA 
and ribosomal protein sequences, which suggested that mesophilic 
Crenarchaeota constitute a distinct phylum that branches oflF near the 
root of Archaea. A few years later, this distinction was confirmed by 
genomic information (e.g., the identification of Thaumarchaeota- 
specific genes) in Cenarchaeum sjmbiosum, Mitrosopumilus maritimus, and 
Mtrososphaera gargensis, representatives of marine and terrestrial 
AOA lineages [2] . 

The discovery of Thaumarchaeota sparked interest not only in the 
field of microbial ecology, but also in the fields of evolution, 
physiology, and molecular biology of the domain Archaea, since the 
majority of its members described so far are mesophilic ammonia 
oxidizers [3] . In the past, the Archaea were thought to be confined 
to extreme environments [4] ; consequendy, their ecological role in 
global geochemical cycling was underestimated. However, a 



number of molecular ecological studies have revealed that the 
Archaea inhabit a wide variety of moderate environments [5], 
suggesting that they play a substantial role in global geochemical 
cycling. Based on 1 6S rRNA gene surveys, the Thaumarchaeota have 
been estimated to represent up to 20% and 5% of aU prokaryotes 
in marine and terrestrial environments, respectively [6-8]. 
Another notable feature of the Thaumarchaeota is that all cultured 
or enriched members of this phylum are ammonia oxidizers [9- 
16]. Before AOA were discovered, ammonia oxidation was 
thought to be performed exclusively by ammonia-oxidizing 
bacteria (AOB) in the bacterial phylum Proteohacteria [17]. Initial 
evidence supporting archaeal ammonia oxidation included the 
discovery of archaeal homologs of bacterial ammonia monoox- 
igenase genes [amoA and amoB) in metagenomes [18,19]. Later, 
additional studies concluded that amo^-carrying archaea are AOA, 
and suggested that AOA could contribute significantly to the 
global nitrogen cycle [9,10,12,20-23]. Recent studies have also 
shown that the copy numbers of archaeal amoA are much higher 
than the copy numbers of bacterial amoA in many soil samples [24— 
29], indicating the predominance of AOA over AOB. 
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Figure 1. Schematic diagram showing the overall process to construct local database and to design primers. Bold letters indicate 
sequence sets (or subsets). Open cross symbols and dashed lines indicate sequence merge points and repeating sub-routines, respectively. 
doi:1 0.1 371 /journal.pone.00961 97.g001 



Since AOA are highly fastidious organisms (only a few 
laboratories have successfully isolated or enriched AOA from the 
environment), most ecological studies of AOA depend on PCR- 
based molecular methods. Hence, PCR primer specificity inevi- 
tably affects analysis and interpretation. However, PCR primers 
specific for AOA or Thaumarchaeota have not been well established. 
Moreover, all 16S rRNA primers previously used to quantify AOA 



targeted a single thaumarchaeotal subgroup [28,30-37]. Almost 
all molecular ecological studies of AOA or the Thaumarchaeota 
assumed that marine group 1.1a AOA (hereafter referred to as 
MG-I) and soil group 1.1b AOA (hereafter referred to as SCO) 
predominated in marine and terrestrial samples, respectively. 
Thus, most of these studies employed primers specific to one of 
these subgroups for estimating the abundance of AOA or 
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Table 1. Summary of thaumarchaeotal 16S rRNA gene sequences used for phylogenetic analysis and primer design. 





Taxa" 


RDP database'' 


Local database 






No. of total sequences 


No. of c|uality- 

filtered*^ sequences No. of sequences 


Mean size (base) ± SD** 


Crenarchaeota 


13,316 


2,616 872 


1,373.7±89.0 


Tbermoprotei 


13,316 (11,406)^ 


2,616 (1,837)' 872 (93)" 


1,3 76.3 ±88.5 


Euryarchaeota 


35,985 


6,072 6,072 


1,366.6±80.7 


Archaeoglobi 


415 


70 70 


1,408.8±65.7 


Halobacteria 


4,576 


1,377 1,377 


1,411.6±60.9 


Methanobacteria 


5,222 


728 728 


1,302.9±71.5 


Methanococci 


474 


126 126 


1,398.4±65.6 


Methanomicrobia 


14,270 


2,058 2,058 


1,377.9±59.8 


Methanopyri 


18 


5 5 


1,400.6± 100.2 


Tbermococci 


670 


229 229 


1,426.4 ±94.5 


Tbermoplasmata 


1,615 


437 437 


1,326.2±74.8 


Unclassified Euryarchaeota 


8,725 


1,042 1,042 


1,326.1 ±90.7 


Korarchaeota 


213 


88 88 


1,306.0±75.2 


Nar)oarchaeota 


54 


3 3 


1,473.3±23.1 


Thaumarchaeota 


_f 


1,549 


1,376.1 ±54.3 


FSCQS 




17 


1,384.1 ±56.1 


HWCG-II|3 




25 


1,342.4±49.3 


MG-|3 




1,100 


1,373.1 ±56.4 






7 


1,372.1 ±57.9 


SAGMCG-I^ 




40 


1,359.8±43.4 


SCG^ 




355 


1,389.5±45.1 


UT-l" 




2 


1,405.5±6.4 


UT-ll" 


_ 


1 


1,232'' 


UT-lll'' 




2 


1,333.0±5.7 


dsag' 




326 


1,293.9±77.2 


THSCG' 




50 


1,340.6±44.3 


MCG' 




483 


1,342.8±91.0 


ug' 




12 


1,345.4±84.9 


Unclassified Archaea' 


1 2,509 


948 272 


1,330.3±75.0 


Total 


62,077 


9,727 9,727 


1,363.4±79.8 


"Phylum- and class-level archaeal groups. 
''RDP database release 10.22. 

'^Filtered using quality check option {http://rdp.cme.msu.edu). 
^Standard deviation. 

"^Sequences belong to unclassified Tbermoprotei. 
'Not shown in RDP database. 

^Subgroups in Tbaumarcbaeota {sequences closely related to MG-I). FSCG (forest soil crenarchaeotic group); HWCG-III (hot water crenarchaeotic group III); MG-I (marine 
group 1); RC (rice cluster); SAGMCG-I (South Africa gold mine crenarchaeotic group 1); SCG (soil crenarchaeotic group). 
'^Unclassified Tbaumarcbaeota. Unclassified thaumarchaeotal subgroups found in this study. 

'Archaeal groups distantly related to the phylum Tbaumarcbaeota. DSAG (deep sea archaeotic group); THSCG (terrestrial hot spring crenarchaeotic group); MCG 
(miscellaneous crenarchaeotic group); UG (unclassified group). 



^As shown in RDP database. 
^Standard deviation not available. 
doi:1 0.1 371 /journal.pone.00961 97.t001 



Thaumarchaeota. Although different thaumarchaeotal subgroups 
have been hypothesized to have different niches [28,38,39], 
samples could potentially harbor an unexpected AOA subgroup 
(e.g., subgroup MG-I in soil samples, or subgroup SCG in marine 
samples), as observed by Tourna et al. [37] and Beman and 
Francis [40]; moreover, samples could also harbor multiple 
subgroups or even as-yet-undiscovered subgroups. In such 
samples, the abundance and diversity of AOA or Tliaumarchaeota 



could be drastically underestimated. We attributed the lack of 
phylum-level primers {Thaumarchaeota-dvcected primers) to the 
ambiguously defined phylogenetic range of Thaumarchaeota and 
the limited number of thaumarchaeotal 16S rRNA sequences 
available at the time of primer design. However, the current 
availability of a large sequence database has facilitated the timely 
design of PCR primers covering the entire phylogenetic range of 
Thaumarchaeota. Such primers will contribute to our understanding 
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of the ecological roles, distribution patterns, and environmental 
factors shaping the niche of this phylum. 

In this study, we constructed a local database of thaumarch- 
aeotal 16S rRNA gene sequences by comparatively analyzing all 
16S rRNA gene sequences deposited in an RDP database, and 
developed a phylum-directed primer for Thaumarchaeota. The 
specificity of the designed primer was assessed by comparing its 
performance to those of existing subgroup-directed primers. To 
the best of our knowledge, this is the first study to comprehensively 
analyze the thaumarchaeotal 16S rRNA gene sequences with the 
most current database, and to design a Thaumarchaeota-directed 
primer. Herein, we describe the phylogenetic diversity and 
breadth of the phylum Thaumarchaeota and the specificity of the 
newly designed PCR primer. 

Materials and Methods 

Construction of Local Database 

Thaumarchaeotal 16S rRNA gene sequences were collected 
from the RDP database (release 10.22, 2010, http:/ /rdp.cme.msu. 
edu) [41]. Because no sequences were found under the database 
category, "phylum Thaumarchaeota" in the RDP database, we first 
identified Thaumarchaeota-rehAtd sequences throughout the RDP 
database using an iterative phylogenetic approach (Fig. 1). 

Using the selection criteria employed by the RDP website, 
chimera check (pass) and length (near full-length [>1.2 kb]) of 
sequences, a total of 9,727 sequences out of 62,077 archaeal 
sequences were downloaded from the RDP database to construct 
our local database (Table 1). Prior to the iteration routine for 
searching and collecting the Jhaumarchaeota-idaXed sequences from 
the downloaded sequences, a backbone phylogenetic tree (Fig. SI) 
was constructed with representative sequences of archaeal taxa 
(see below for the phylogenetic reconstruction). The sequence set 
for the backbone tree, B = {backbone sequences}, included 106 
sequences from the group MG-I as key members of the phylum 
Thaumarchaeota, which were selected based on published literature 
[12,42^9], and 72 genus-level representative sequences (type 
strains and genome-sequenced strains) from the phyla Euryarch- 
aeota, Crenarchaeota, Manoarchaeota, and Korarchaeota (Table. SI). 

In the first round of the iteration routine {i= 1 and j= 1), a 
subset of downloaded RDP sequences, d„ was added to the 
sequence set B, resulting in a combined sequence set, C, = B+d,. 
Then a subset of MG-I-related sequences, m^,, in the sequence set 
d, (mj,Cd,), was identified from a phylogenetic tree constructed 
with the sequence set C, (see below for the phylogenetic 
reconstruction). The input sequences that formed a tight 
(bootstrap score >80%) monophyletic cluster with the MG-I 
sequences that were included in the sequence set B were regarded 
as 'MG-Trelated sequences'. In the following rounds (/ =j+l), MG- 
I-related sequences were searched repeatedly in a phylogenetic 
tree constructed with a subtracted sequence set, ?>i(j+i) = S,j—m,j 
(S^ = C,— mj, if 7 = 1) until no more MG-I-related sequences were 
found in the sequence set S,ij+iy After collecting MG-I-related 
sequences (M, = ^ m^^) from the sequence subset d„ the 
monophyly of the sequence set M, was evaluated again with the 
bootstrap score, then the iteration routine was performed for the 
remaining subsets of the RDP sequences, d,+i. Finally, the 
downloaded RDP sequences that formed a monophyletic cluster 
(^,M,) were regarded as thaumarchaeotal sequences (T) in our 
local database. Our local database sequences are available at 
http://sdrv.ms/lk8frAc with RDP's structure-based alignment 
format. 

The final version of phylogenetic tree was constructed with the 
thaumarchaeotal sequences (T) identified during the iteration 




Figure 2. Phylogenetic positions of archaeal sequences in our 
local database. The phylogenetic distances of each sequence were 
calculated using the Jukes-Cantor model, and the tree was constructed 
using the neighbor-joining (NJ) algorithm. The numbers at the nodes 
indicates the bootstrap score (as a percentage) and are shown for the 
frequencies at or above the threshold of 50%. Open circles indicate the 
bootstrap score >50% estimated using the randomized accelerated 
maximum likelihood (RAxML) algorithm (GTR+CAT approximation). 
Arrow indicates an internal node corresponding to the phylum 
Thaumarchaeota. The scale bar represents the expected number of 
substitutions per nucleotide position. 
doi:10.1371/journal.pone.0096197.g002 

routine. Backbone sequences (B) were included in the phylogenetic 
tree to show the phylogenetic position of the phylum Thaumarch- 
aeota. Some sequences distantly related to the group MG-I were 
also included in the final version of the phylogenetic tree for later 
reference. 
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Table 3. Number of phylotypes and intra-group genetic distances of archaeal groups Included in the local database. 



Taxa 


Intra-group genetic distance^ (mean ± SD^ 


No. of phylotype 


Average no. of sequences per phylotype 


O* 6 nor ch ciGOto 


U. 1 DZ_U.UjU 


c 




Euryorchocoto 


_U.U/D 






Kororchdcoto 


u.Uisy 






Nonoorchocoto 


IN A 








U.UOD — U.UDy 








U. 1 ^0 — \J.\J^D 








n nQi +n n^n 
u.uy 1 —U.UoU 






UG 


U. 1 Z'f — U.UjD 






ThoumorchoGotci 


U. 1 D J —U.UD^ 


114 


14 3 


FSCG 


U. 1 UiS — U.UZo 


1 0 


1 7 


HWCG-III 


0.097±0.027 


11 


2.3 


MG-I 


0.083 ±0.022 


49 


25.0 


RC 


0.082+0.033 


3 


2.3 


SAGMCG-I 


0.074±0.022 


9 


4.3 


SCG 


0.078±0.021 


27 


12.5 


UT-I 


0.074^^ 


2 


1.0 


UT-II 


NA 


1 


1.0 


UT-I II 


0.048"^ 


2 


1.0 



^Intra-group genetic distances for Crenarchaeota, Euryarchaeota, Korarchaeota, and Nanoarchaeota were estimated from the bacl<bone sequences. 

"^Standard deviation. 

'^Not determined. 

^Standard deviation not available. 

^Not available due to the insufficient number of sequences. 
doi:l 0.1 371/journal.pone.00961 97.t003 



Estimating Genetic Distances and Rarefaction Analysis 

Pair-wise genetic distances between sequences in the local 
database were measured witli MEGA [50] using the Jukes-Cantor 
(JC) model [51] and were subjected to UPGMA (unweighted pair 
group method with arithmetic mean) cluster analysis implemented 
in MEGA. Thaumarchaeotal phylotypes were defined by a 
cophenetic distance of 0.2 (sequence similarity >98%) in the 
cluster analysis, which roughly corresponded to species-level 16S 
rRNA gene similarity [52]. Intra- and inter-subgroup genetic 
distances were estimated from the phylotype sequences of the 
archaeal groups listed in Table 1. For the phyla Crenarchaeota, 
Euryarchaeota, Korarchaeota, and Nanoarchaeota, only the genus-level 
representative sequences included in the backbone sequence set (B) 
were used for the estimation of the genetic distances due to the 
calculation load. 

Rarefaction analysis was performed to estimate the phylotype 
richness of the phylum Thaumarchaeota. Phylotypes were defined as 
operational taxonomic units (OTUs), and the OTU richness 
estimators {S„i,^ for the phylum Thaumarchaeota and each of its 
subgroups were calculated using Estimates [53]. 

Phylogenetic Reconstruction 

A multiple alignment of the phylotype sequences determined by 
the cluster analysis was subjected to phylogenetic reconstruction. A 
bacterial sequence (GenBank accession no. JO 1695) was used as an 
outgroup, and sequences belonging to archaeal phyla other than 
the phylum Thaumarchaeota were also included in the phylogenetic 
analysis. Phylogenetic trees were inferred using the neighbor 
joining (NJ) algorithm implemented in the MEGA [50]. The tree 
topology was statistically evaluated by 100 bootstrap re-samplings 
and was further confirmed using the maximum likelihood (ML) 



algorithm (GTR+CAT approximation) implemented in the 
RAxML [54]. 

Local Bayesian Classifier 

A local naive Bayesian classifier was built with our local 
database to serve as a training set of sequences. The algorithm 
previously established by Wang et al [55], which is currentiy 
implemented in RDP's classifier, was used to estimate the 
probability that a query sequence, S, is a member of phylum D, 
ii;D I S) = ii;S I D) ■kP(D)/P{S], where P{p) is the prior probability of 
a sequence being a member of phylum D and i^S) is the overall 
probability of finding sequence S in any phyla. The joint 
probability of observing the sequence S, which contains a set of 
words (subsequences, Vi), was estimated as P{S | D) = 11 Piy; \ D). 
Word-specific priors were calculated with the 8-base subsequenc- 
es, and the priors P{p) and PiS) were assumed to be constant terms 
according to the original paper [55]. The query sequences that 
gave the highest probability were classified as being members of 
the phylum D. 

Primer Design 

A phylum-directed primer for the Thaumarchaeota was designed 
with a thaumarchaeotal consensus sequence using Primrose [56]. 
The consensus sequence (majority rule) for the phylum Thau- 
marchaeota was obtained from the thaumarchaeotal phylotypes in 
the local database. We permitted no degenerate site in the primer 
sequences. The specificity of the designed primer (or primer pairs 
used) was evaluated using Ohgocheck (www.cf.ac.uk/biosi/ 
research/biosoft/) with local database sequences and determined 
both by the extent to which the primer binds to target group 
sequences (coverage) and non-target sequences (tolerance). The 
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Figure 3. Rarefaction curves for the phylum Thaumarchaeota and its subgroups. Phylotypes were defined as operational taxonomic unit 
(OTU). (A) Phylum Thaumarchaeota and subgroups MG-I and SCG. (B) Subgroups FSCG, HWCG-III, RC, and SAGiVlCG-l. 
doi:1 0.1 371 /journal.pone.00961 97.g003 



tolerance of the primer to domain Bacteria was evaluated using the 
ProbeMatch program implemented in the RDP website. The 
thermodynamic properties (e.g., free energy, AG, predictions) for 
self-complementary structures (hair-pin and primer-dimer) were 
determined using NetPrimer (http://www.premierbiosoft.com/ 
netprimer/). 



PCR Amplification 

Soil samples were collected from Kellerberrin in Western 
Australia, HUlgate in southern California, La Campana in central 
Chile and Incheon in Korea, and a marine sediment sample was 
collected from the continental shelf of the Yellow Sea, west of Jeju 
Island, Korea. Details of the sampling locations and the 
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physicochemical properties of the samples were described 
elsewhere [57,58]. Community DNAs were directly extracted 
from the samples using the MoBio PowerSoil DNA extraction kit 

(MoBio Laboratories, Solana Beach, Calif, USA) according to the 
manufacturer's protocol and were individually used as PCR 
templates. 

The phylum-directed primer designed in this study (primer 
THAUM-494) was used with the universal primer ARC917R or 
1017FAR [59,60]. The reaction mixture included 25 jil of 
TaKaRa Ex Taq premix (Takara, Shiga, Japan), 1 |a.l of each 
forward and reverse primer (stock concentration, 20 |J,M), 200 ng 
of template DNA extracted from the soil sample, and sterilized 
distilled water to give a 50 |J,1 final volume. The PCR thermal 
profile was as follows: initial denaturation at 95°C for 5 min, 
followed by 30 cycles consisting of denaturation at 95°C for 30 s, 
primer annealing at 55°C for 30 s, and extension at 72°C for 30 s. 
The final elongation step was extended to 20 min. PCR 
amplification was performed with a GeneAmp PCR system 
9700 (Apphed Biosystems, Foster City, Calif., USA). Positive PCR 
ampKcons were confirmed using agarose gel electrophoresis. 

Cloning 

The PCR amplicons were purified using a QIAquick PCR 
purification kit (Qiagen, Valencia, Calif, USA) and cloned into 
TOPO cloning vectors with a TOPO TA cloning kit (Invitrogen, 
Carlsbad, Calif, USA) to construct the clone libraries according to 
the manufacturer's protocol. Insert sequences of 859 clones, which 
were randomly selected from 10 clone libraries (ca. 90 clones per 
primer pairs used and 180 clones per sample) were sequenced 
using an ABI3700 DNA analyzer (Applied Biosystems, Foster 
City, Calif, USA). The phylogenetic alfiliations of the cloned 
sequences were determined using phylogenetic reconstruction and 
were further confirmed by our local Bayesian classifier. When the 
local Bayesian classifier applied, the cloned sequences were 
classified as being members of the phylum D or its subgroup O 
that gave the highest probability, P(D \ S) or P{0 \ S), where S was a 
cloned sequence as a query. 

Nucleotide Sequence Accession Numbers 

All sequences produced in this study have been deposited in 
GenBank under the accession numbers KF275675 to KF276604. 

Results and Discussion 

Overview of Thaumarchaeotal Sequences in Public 
Databases 

To design taxon-directed primers (or probes), the entire 
taxonomic range of the target taxon should be clearly defined so 
that designers can easily distinguish sequences belonging to non- 
target taxa from those belonging to the target taxon. While 
prokaryote taxonomy should not be solely deduced from 16S 
rRNA gene sequences, a priori knowledge of the phylogenetic 
breadth of the target taxon, which is partially reflected in the 16S 
rRNA-based phylogenetic tree, is a prerequisite for developing 
primers that specifically bind to complementary regions of target 
16S rRNA gene sequences. However, our knowledge regarding 
the phylogenetic range of Thaumanhaeota is rather limited, with the 
exceptions of certain subgroups (MG-I, SCG, andHWCG-III [hot 
water crenarchaeotic group III]) used to define this phylum [1]. A 
large number of archaeal 16S rRNA gene sequences have been 
deposited into public databases (e.g., 128,378 and 38,641 in RDP 
[release 10.32, 201,3] and SILVA [release 114, 2013] databases, 
respectively). However, only two studies, both of which compre- 
hensively analyzed archaeal 16S rRNA sequences that are either 



closely or distantly related to Thaumanhaeota, have attempted to 
define the phylogenetic range of Thaumanhaeota [6 1 ,62] . These two 
studies both classified the archaeal groups MG-I, SCG, 
SAGMCG-I (South Africa gold mine crenarchaeotic group I), 
and HWCG-III within the phylum Thaumanhaeota. However, these 
two studies reached contradictory conclusions regarding classifi- 
cation of the other subgroups. 

The SILVA database (release 114, 2013, http://arb-silva.de) 
[63] assigns 15,773 16S rRNA gene sequences to the phylum 
Thauinnrclmeota, and divides Thaiimarclmeota into 27 subgroups. 
However, this classification is mainly based on the phylogenetic 
assignment of the 16S rRNA gene sequences according to 
literature, with the phylogenetic positions of those sequences 
determined by SILVA's workflow (personal communication). To 
date, no supporting materials have been published regarding 
SILVA's classification scheme for thaumarchaeotal 16S rRNA 
gene sequences. For example, the SILVA database classifies 16S 
rRNA gene sequences of MCG (miscellaneous crenarchaeotic 
group, previously named TMCG, where "T" stands for "terres- 
trial"), DSAG (deep sea archaeal group, alternatively named 
MBG-B, marine benthic group B), and HWCG-I (hot water 
crenarchaeotic group I) into Thaumanhaeota. However, recent 
studies have concluded that rRNA-based phyl()geni(;s are insulfi- 
cient for determining the relationship of these three archaeal 
groups to Thaumanhaeota, proposing that more data are required to 
accurately define their taxonomic positions [62,64]. The situation 
is more complicated with other 16S rRNA gene-oriented 
databases. For example, the RDP database does not retrieve any 
16S rRNA gene sequences when the Thaumanhaeota taxonomic 
categories used. Moreover, the Greengenes database (2013 
version, http://greengenes.lbl.gov) [65] does not even index the 
phylum Thaumanhaeota. Due to the lack of robust phylogenetic 
affiliations and the difficulties in accessing Thaumanhaeota-rdateA 
sequences in current public databases, we constructed our own 
local database of thaumarchaeotal 16S rRNA gene sccjuences to 
facilitate the development of TTiaumanhaeota-direcied primers. 

Collection of Thaumarchaeota-re\ated Sequences 

We initially constructed a backbone phylogenetic tree (Fig. SI) 
for the domain Anhaea. This tree was built with 106 MG-I 
sequences, selected from the published literature, and 72 
representative sequences from well-defined taxa in the archaeal 
phyla Crenarchaeota, Euryanhaeota, Korarchaeota, and Nanoanhaeota 
(Table. SI). Since MG-I is a basal constituent of the phylum 
Thaumanhaeota [1], the only thaumarchaeotal sequences initially 
included in the backbone phylogenetic tree were these 106 MG-I 
sequences. Thaumanhaeota sequences were limited in this way to 
avoid phylogenetic inferences for other thaumarchaeotal sub- 
groups at the initial stages of sequence collection, as well as to 
define the minimum phylogenetic range of Thaumanhaeota. In this 
backbone tree, the MG-I sequences formed a tight monophyletic 
cluster (bootstrap score, 100%), comprising a lineage distinct from 
Crenanhaeota, Euryanhaeota, Korarchaeota, and Nanoarchaeota. While 
Crenarchaeota was a monophyletic group (bootstrap score, 98%), 
Nanoarchaeota and Korarchaeota were not-well-resolved in this tree. In 
addition, Euryarchaeota was split into four independent clusters and 
appeared to be paraphyletic, as previously reported [66,67]. 
However, further attempts to clarify the phylogenetic positions of 
these unresolved phyla were left for future studies, since 
phylogenetic inference for archaeal phyla other than Thaumarch- 
aeota was considered beyond the scope of this study. In subsequent 
iteration steps for collecting thaumarchaeotal sequences (Fig. 1), 
16S rRNA gene sequences belonging to archaeal phyla other than 
Thaumarchaeota were considered to be non-target (non-thaumarch- 
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aeotal) sequences. All sequences forming a tight monophyletic 
cluster (bootstrap score >80%) with MG-I sequences were 
collected and classified as thaumarchaeotal. 

During iteration routine performed with the 178 sequences 
included in the backbone tree and the 9,727 RDP quality-filtered 
archaeal sequences downloaded from the RDP database, 272 
RDP sequences could not be assigned to any archaeal group in the 
backbone tree due to their unstable phylogenetic positions near the 
root of the domain Archaea. These sequences were assigned to 
'unclassified Archaea' in our local database. With the exceptions of 
the RDP sequences that had been properly classified as 
Euiyarchaeota, Nanoarchmota, or Korarchaeota at the phylum-level in 
the RDP database (n = 6,163), we found that 2,419 RDP 
sequences (hereafter referred to as U-RDP sequences) formed 
clusters distinct from Crenarcfmeota, Euiyarchaeota, Manoarchaeota, and 
Korarchaeota. All of these U-RDP sequences were derived from 
RDP taxa designated as either 'unclassified Thermoprotei' in the 
phylum Crenarchaeota (n = 1,596) or 'unclassified Archaea' (n = 823) 
(Table 1). A neighbor joining (NJ) phylogenetic tree was 
constructed with these U-RDP sequences and backbone sequences 
(Fig. 2); in this tree, 1,549 U-RDP sequences (hcr(^after referred to 
as T-RDP sequences) formed a tight monophyletic cluster with 
MG-I sequences, which was supported by a high bootstrap score 
(92%). The monophyly of the T-RDP cluster was also confirmed 
in a maximum likelihood (ML) tree, with an associated bootstrap 
score of 95%. We assigned this monophyletic group to the phylum 
Thaumarchaeota, containing nine subgroups in this study (Table 1; 
Tal)l(- S2). Among these subgroups, six corresponded to previously 
recognized archaeal groups: MG-I, SAGMCG-I, SCG, FSCG 
(forest soil crenarchaeotic group), RC (rice cluster), and HWCG- 
III. The remaining subgroups, comprised of five sequences 
(AB050231, ABl 13628, EF444677, HM187528, and 
HM187541), were assigned to groups designated as UT-1, -2, 
and -3 (UT, unclassified Thaumarchaeota) in our local database, 
since their previous designations were unclear or unavailable. Two 
UT-1 sequences, AB050231 and ABl 13628, had previously been 
reported as SAGMCG-II sequences in previous studies [48,68], 
whereas the other U-RDP sequences belonging to SAGMCG-II 
were merged into the MG-I cluster in our phylogenetic tree. 
Among the six subgroups corresponding to known archaeal 
groups, five subgroups (MG-I, SCG, FSCG, RC, and HWCG-III) 
appeared to be monophyletic (bootstrap scores >50%). 

The remaining U-RDP sequences (hereafter referred to as N- 
RDP sequences, n = 870) were clustered into four independent 
groups, whose cohesive phylogenetic relationships to known 
archaeal phyla were not supported by high bootstrap scores 
(bootstrap scores <50%) in either NJ or ML trees. Three groups of 
these N-RDP sequences corresponded to previously reported 
archaeal groups: DSAG, MCG, and THSCG (terrestrial hot 
spring crenarchaeotic group), whereas the remaining N-RDP 
group was arbitrarily dc'signatcd as 'unclassified group' (UG) in 
this study. These UG sequences had also been designated as 
unclassified archaea in previous studies [42,45,48,69,70]. In 
addition to phylogenetic tree topologies, the inter-group genetic 
distances between MG-I and the four N-RDP sequence groups 
(0.283±0.028) were as great as those between the recognized 
archaeal phyla Crenarchaeota, Euryarchaeota, Manoarchaeota, and 
Korarchaeota (inter-phylum genetic distance, 0.297 ±0.023) 
(Table 2). On the other hand, inter-group genetic distances 
between MG-I and its close sister groups (FSCG, HWCG-III, RC, 
SAGMCG-I, SCG, and UTs) (0.165±0.035) were significantly 
smaller (a = 0.05, i-test) than the average inter-phylum genetic 
distance, thus supporting our initial hypothesis that these groups 
belong to the phylum Thaumarchaeota. Consequentiy, we classified 



MG-I, FSCG, HWCG-III, RC, SAGMCG-I, SCG, and UTs into 
Thaumarchaeota in our local database and considered DSAG, MCG, 
THSCG, and UG to be distinct phylum-level groups, not affiliated 
with any recognized archaeal phylum. Consistent with our results. 
Pester et al. [62] showed that Thaumarchaeota included the archaeal 
groups MG-I, SCG, HWCG-III, SAGMCG-I, and FSCG. Pester 
and colleagues also noted that DSAG and MCG are not clearly 
affifiated with any established archaeal phylum. In the final version 
of our local database, we noticed that sequences belonging to 
another recently proposed archaeal phylum, Aigarchaeota [71], were 
included in the phylum Crenarchaeota. No close relationships 
between aigarchaeotal sequences and T-RDP sequences were 
observed during our analysis. 

Using cluster analysis (cutoff level, 98% cophenetic similarity), 
thaumarchaeotal phylotypes were defined with 1,549 thaumarch- 
aeotal 16S rRNA gene serjuences (T-RDP sequences; average size, 
1376.7±54.3 bases) in the local database. In total, 114 phylotypes 
were observed (Table ?>), \\ith the majority (66.7%) belonging to 
the MG-I and SCG subgroups, which together contained most 
(93.9%) of the thaumarchaeotal sequences in our local database. 
No plateaus were observed on Thaumarchaeota rarefaction curves 
(Fig. 3), suggesting that, on a global scale, sequence sampling is still 
inadequate for accurately estimating the extent of diversity within 
the phylum Thaumarchaeota. However, the numbers of sequences 
per phylotype within the MG-I and SCG subgroups were much 
larger than those for other subgroups (Table 3), indicating that the 
majority of the sequences currentiy appended to these two groups 
actually belong to previously sampled phylotypes. 

Design and In silico Evaluation of the Thaumarchaeota- 
directed Primer 

Taking into consideration the specificity for its intended target 
sequences, as well as its th("rmodynamic propensities to form self- 
complementary structures (such as hair-pins and primer-dimers), 
we developed a Thaumarchaeota-directed primer, henceforth 
referred to as THAUM-494, from the consensus sequence of all 
thaumarchaeotal phylotype sequences (Table 3; Table S3). The 
specificity of THAUM-494, defined in terms of its coverage and 
tolerance, was evaluated in silico with our local database. We 
defined primer coverage as the extent to which the primer binds its 
target group sequences, tolerance as the extent to which the 
primer binds non-target sequences; both of these parameters for 
THAUM-494 were compared ^\'ith those of pre\ i()iisly designed 
primers (or probes) targc-ting sequences belonging to MG-I or 
mesophilic Crenarchaeota (Table 4 and 5; Table S4). THAUM-494 
showed >90"/a coverage for Thaumarchaeota and <1% tolerance to 
non-target taxa, indicating high specificity for the phylum 
Thaumarchaeota. AU non-target sequences (n = 8) with regions 
complementary to THAUM-494 belonged to the class Thermoprotei 
of the phylum Crenarchaeota. Further examination of the non-target 
sequences binding THAUM-494, using our local Bayesian 
classifier, revealed that they were all highly related to the 
thaumarchaeotal subgroups SCG or HWCG-III. THAUM-494 
covered the major subgroups MG-I, SCG, and SAGMCG-I at a 
rate of >90% each. THAUM-494 did not bind to sequences in 
FSCG and RC, two minor subgroups comprising only 1.5% of all 
thaumarchaeotal sequences. 

In addition to THAUM-494, we also used our local database to 
evaluate the specificity of previously designed primers (probes) 
targeting Thaumarchaeota-relMed taxa (Table 5; Table S4). The first 
oligonucleotide sequence specifically designed to target one of the 
thaumarchaeotal subgroups was the probe GI-554 [72], developed 
for MG-I (crenarchaeal group I in original paper) in 1997. 
Although the number of available MG-I sequences was limited at 



PLOS ONE I www.plosone.org 



15 



May 2014 I Volume 9 | Issue 5 | e96197 



Thaumarchaeota-Dkected PCR Primer 



the time, GI-554 showed satisfactory coverage (92.1%) for MG-I 
and extremely low (0.2%) non-target binding in our in silico 
analysis. However, as intended, GI-554 bound only to MG-I 
sequences and did not bind to sequences in other thaumarchaeotal 
subgroups. Several years later, three primer pairs, 771F-957R 
[7,33,35,73], GI751F-GI956R [31,74,75], and MCGI391f- 
MCGI554r (identical to MGI391 and Cren537) [30,36,76], were 
developed to amplify the 16S rRNA gene sequences of terrestrial 
crenarchaeota, nitrifying archaea, and MG-I, respectively. Our 
in silico results showed that 77 IF had relatively high coverage 
(94.6%) for nimman/meota, hut also had high tolerance to 
Crenarchaeota (10.9%), as well as other non-target groups (> 
28.0%). The reverse partner of 77 IF, 957R, covered only SCG 
sequences (96.1%) and exhibited low tolerance (<1%) to non- 
thaumarchaeotal sequences. Similarly, GI-751F bound 41.8% of 
all MG-I sequences, but did not cover other thaumarchaeotal 
subgroups. GI-956R showed high coverage (95.4%) for Thau- 
marchaeota, but was also highly tolerant to Crenarchaeota (tolerance, 
51.0%). Primer MCGI391f (MGI391) showed a specificity simHar 
to the primer GI-751F, with slighdy increased coverage (50.4%) 
for MG-I. The coverage and tolerance of MCGI554r (Cren537) 
were very similar to those of GI-554. Primers 542F, CREN512, 
Cren745a, and Cren518, originally designed to target the 
mesophUic group of Crenarchaeota when the current thaumarch- 
aeotal subgroups (MG-I, SCG, and FSCG) were considered to 
belong to Crenarchaeota [77-80], had high coverage not only for 
Crenarchaeota, but also for Thaumarchaeota. In summary, the 
prc\-i()usly designed primers with high coverage for the phylum 
Thaumarchaeota showed high tolerance to non-thaumarchaeotal 
taxa, and the- primers with low tolerance to non-thaumarchaeotal 
taxa showed low coverage for Thaumarchaeota. 

Empirical Evaluation of the Thaumarchaeota-dwecXed 
Primer 

To empirically evaluate the specificity of THAUM-494, we 
constructed clone libraries using PCR products obtained with 
TIIAUM-494 (Table 6). For a reverse primer, one of universal 
primers, ARC917R [60] or 1017FAR [59] was used, and 
environmental DNA extracted from soil and marine samples was 
used as a PCR template. The in silico coverage estimates of the 
universal primers ARC917R and 1017FAR for Thaumarchaeota 
were 96.1% and 82.2%, respectively (Table 7 and Table 8). 
Among the primer pair combinations (THAUM-494 and univer- 
sal primer) that generated PCR amplicons >400 bp in size, 
primer pair THAUM-494-ARC917R demonstrated the highest 
coverage (89.3%) for Thaumarchaeota (Table 8). Five clone libraries 
were constructed for each primer pair. Phylogenetic positions of 
cloned sequences were determined from phylogenetic trees built 
with reference sequences (Fig. S2-S6) and were confirmed with 
our local Bayesian classifier, developed with the local database 
sequences. 

Both primer pairs showed satisfactory specificity under exper- 
imental conditions, as predicted by our in silico analysis. Phyloge- 
netic analyses of 436 and 423 clones from libraries constructed 
with THAUM-494-ARC917R and THAUM-494-1017FA 
showed that 95.5±3.7% and 96.9±6.4%, respectively, belonged 
to the phylum Thaumarchaeota (Table 6). The thaumarchaeotal 
subgroups most frerjuently sampled were SCG, MG-I, and 
SAGMCG-I. THAUM-494 showed very low non-target binding 
under the PCR conditions used in this study. The majority of non- 
target sequences amplified with both primer pairs belonged to the 
MCG subgroup. Interestingly, a considerable number of unclas- 
sified thaumarchaeotal sequences (UT sequences, 23-28%), whose 
phylogenetic positions within Thaumarchaeota were not precisely 



defined by our thaumarchaeotal subgrouping scheme, were 
observed in the HiUgate woodland soil sample and an Incheon 
paddy soil. These results suggest that Thaumarchaeota might contain 
additional as-yet-undiscovered diversity, which can be further 
explored with THAUM-494 in future work. 

The proportions of thaumarchaeotal subgroups in each clone 
hbrary varied sKghdy with the universal primer used. For example, 
MG-I sequences were not detected in the KeUeberrin sample 
when THAUM-494 was paired with 1017FAR; in contrast, MG-I 
sequences comprised 1 1 % of the total amplified sequences when 
THAUM-494 was paired with ARC917R. We attributed such 
variations in subgroup detection to co\'erage differences between 
universal primers for each thaumarchaeotal subgroup. Our in silico 
analysis (Table 8) indicated that 1017FAR showed lower coverage 
(75.6'l'i,) for MG-I than ARC917R (96.0%). Hence, it is very 
important to pair THAUM-494 with the appropriate universal 
primer for unbiased sampling of Thaumarchaeota diversity. Although 
we experimentally tested only two universal primers to determine 
whether the choice of universal primer affects post-PCR sequence 
analysis, the coverages and tolerances of other well-known 
archaeal universal primers were estimated throughout the archaeal 
taxa by in silico analyses (Table 7 and Table 8; Table S5-S7). 
Thus, the results presented here will serve as a guideline for the 
selection of appropriate primer pairs for researchers to use in their 
particular appUcations. 

Concluding Remarks 

Our knowledge of the phylum Tliaumarehaeota, particularly 
regarding its ecological niche and diversity, is expanding rapidly, 
but is still limited. Since Thaumarchaeota are globally distributed and 
abundant [2,62], these archaea likely play a crucial role in 
sustaining species diversity as well as maintaining geochemical 
cycles. Physiological, molecular, and ecological surveys have been 
undertaken to better understand this phylum. As a result of such 
efforts, a larg(; number of thaumarchaeotal 16S rRNA gene 
sequences have been deposited in public databases. However, our 
rarefaction analyses indicate that the richness of thaumarchaeotal 
phylotypes has not yet reached its plateau, indicating that this 
phylum may have a much wider phylogenetic breadth than 
currently estimated. To facilitate comprehensive exploration of the 
diversity and ecological role of Thaumarchaeota, we developed 
THAUM-494, the first phylum-level primer for Tliaumarehaeota to 
the best of our knowledge. The high coverage and low tolerance of 
THAUM-494 make it especially useful for estimating phylogenetic 
diversity and determining the distribution patterns of Thnwnarch- 
aeota (e.g., high-throughput metagenome sequencing and real-time 
PCR assays). Furthermore, this primer will be a valuable tool for 
understanding the ecological niche of Thaumarchaeota. 

Supporting Information 

Figure SI Backbone phylogenetic tree. The phylogenetic 
distances of each sequence were calculated using the Jukes-Cantor 

model, and the tree was constructed using the neighbor-joining 
algorithm. The numbers at the nodes indicates the bootstrap score 
(as a percentage) and are shown for the frequencies at or above the 
threshold of 50%. The scale bar represents the expected number 
of substitutions per nucleotide position. 
(PDF) 

Figure S2 Phylogenetic positions of cloned sequences. 

Cloned sequences recovered from Kellerberrin, Australia. A, 
primer pairs THAUM-494-ARC917R; B, primer pairs THAUM- 
494- 101 7R. The phylogenetic distances of each sequence were 
calculated using the Jukes-Cantor model, and the tree was 
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constructed using the neighbor-joining algorithm. The numbers at 
the nodes indicates the bootstrap score (as a percentage) and are 
shown for the frequencies at or above the threshold of 50%. The 
scale bar represents the expected number of substitutions per 
nucleotide position. 
(PDF) 

Figure S3 Phylogenetic positions of cloned sequences. 

Cloned sequences recovered from Hillgate, California. A, primer 
pairs THAUM-494-ARC917R; B, primer pairs THAUM-494- 
101 7R. The phylogenetic distances of each sequence were 
calculated using the Jukes-Cantor model, and the tree was 
constructed using the neighbor-joining algorithm. The numbers 
at the nodes indicates the bootstrap score (as a percentage) and are 
shown for the frequencies at or above the threshold of 50%. The 
scale bar represents the expected number of substitutions per 
nucleotide position. 
(PDF) 

Figure S4 Phylogenetic positions of cloned sequences. 

Cloned sequences recovered from La Campana, Chile. A, primer 
pairs THAUM-494-ARC917R; B, primer pairs THAUM-494- 
101 7R. The phylogenetic distances of each sequence were 
calculated using the Jukes-Cantor model, and the tree was 
constructed using the neighbor-joining algorithm. The numbers 
at the nodes indicates the bootstrap score (as a percentage) and are 
shown for the frequencies at or above the threshold of 50%. The 
scale bar represents the expected number of substitutions per 
nucleotide position. 
(PDF) 

Figure S5 Phylogenetic positions of cloned sequences. 

Cloned sequences recovered from Incheon, Korea. A, primer pairs 
THAUM-494-ARC917R; B, primer pairs THAUM-494-1017R. 
The phylogenetic distances of each sequence were calculated using 
the Jukes-Cantor model, and the tree was constructed using the 
neighbor-joining algorithm. The numbers at the nodes indicates 
the bootstrap score (as a percentage) and are shown for the 
frequencies at or above the threshold of 50%. The scale bar 
represents the expected number of substitutions per nucleotide 
position. 
(PDF) 

Figure S6 Phylogenetic positions of cloned sequences. 

Cloned sequences recovered from Jeju, Korea. A, primer pairs 
THAUM-494-ARC917R; B, primer pairs THAUM-494-1017R. 
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